Probability¶
Probability measures of the likelihood of an event as value between 0 (impossible) and 1 (certain).
Basic Concepts¶
- Sample Space
- Definition: The set of all possible outcomes of a random experiment.
- Types of sample spaces: finite, infinite, discrete, continuous.</br>
- Events
- Definition: A set of outcomes from a random experiment, as well as a subset of sample space.
- Types of events: Simple events (1 outcome) VS. compound events (2+ outcomes), mutually exclusive events VS. complementary events.
- Complement Events: $P(A^′) = 1 - P(A)$
- Mutually Exclusive Events: the ocurrence of Event1 prevents Event2 occuring.</br>
- Independence and Conditional Probability
- Independent: Two events are independent if the outcome of one doesn't affect the other, i.e. $P(A \mid B) = P(A) $
- Conditional Probability: The probability of event A given that event B has occurred, i.e. $P(A \mid B) = \frac{P(A \cap B)}{P(B)}$</br>
- Basic Probability Rules
- Additional Rule (at least 1 event occurs):
- For general events: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
- For mutually exclusive enents $P(A \cup B) = P(A) + P(B)$
- Multiplication Rule (both events occur):
- For dependent events (Conditional probablilty): $P(A \cap B) = P(A) \cdot P(B \mid A) = P(B) \cdot P(A \mid B)$
- For independent events: $P(A \cap B) = P(B) \cdot P(A)$
- Additional Rule (at least 1 event occurs):
Random Variables (RV)¶
Random Variables is a function assigns numberic value to the outcomes of a sample space.
Example 1.2.1¶
Suppose we toss a coin and record the sequence of heads (ℎ) and tails (𝑡). The sample space is 𝑆 = {ℎℎ,ℎ𝑡,𝑡ℎ,𝑡𝑡}.
We can define a random variable 𝑋 that tracks the number of heads obtained in an outcome. Formally, we denote this as follows: $$X: S \rightarrow \mathbb{R} \\ s \mapsto\ \text{number of}\ h\text{'s in}\ s $$ We can list the value of 𝑋 for each outcome individually: \begin{align*} \text{inputs:}\ S\ &\stackrel{\text{function:}\ X}{\longrightarrow}\ \text{outputs:}\ \mathbb{R} > \\ hh &\quad\stackrel{X}{\mapsto}\quad 2 \\ th &\quad\stackrel{X}{\mapsto}\quad 1 \\ ht &\quad\stackrel{X}{\mapsto}\quad 1 \\ tt &\quad\stackrel{X}{\mapsto}\quad 0 \end{align*} We can also write as follows: $𝑋(ℎℎ)=2,\ 𝑋(ℎ𝑡)=𝑋(𝑡ℎ)=1,\ 𝑋(𝑡𝑡)=0.$
Discrete Random Variables¶
A discrete RV is a random variable that has only a finite or countably infinite (e.g. integers) number of possible values, such as number of heads in coin tosses.
Probability Mass Functions (PMFs)¶
- Probability Mass Function (or frequency function) computes the probability that a discrete random variable equals a specific value. More specifically, let 𝑋 be a discrete random variable 𝑋 with possible values 𝑥_1, 𝑥_2, … 𝑥_i, then the probability mass function of 𝑋, denoted as 𝑝, is given that $$p(x_i) = P(X = x_i)$$
- Properties of PMF:
- $\sum_{x_i} p(x_i) = p(x_1) + p(x_2) + \cdots = 1$
- $p(x_i) \geq 0$, for all 𝑥_i
- If 𝐴 is a subset of the possible values of 𝑋, then the probability that 𝑋 takes a value in 𝐴 is given by $$P(X\in A) = \sum_{x_i\in A} p(x_i)$$
- We can represent PMF numerically with a table, graphically with a histogram, or analytically with a formula.
Example 1.2.2¶
![]()
In the histogram, each rectangle has width 1 and height equal to the probability of the value of the random variable 𝑋. For example, the leftmost rectangle in the histogram is centered at 0 and has height equal to 𝑝(0)=0.25, which is also the area of the rectangle since the width is equal to 1. In this way, histograms provides a visualization of the distribution of the probabilities to the possible values of the random variable 𝑋.
Cumulative Distribution Functions (CDFs)¶
- The cumulative distribution function of a random variable 𝑋 is a function on the real numbers that is denoted as 𝐹 and is given by $$F(x) = P(X\leq x),\quad \text{for any}\ x\in\mathbb{R}.$$
- CDFs are also defined for continuous random variables in exactly the same way.
- The CDF of a random variable is defined for all real numbers, unlike the PMF of a discrete random variable which we only define for the possible values of the random variable.
- The CDF of a discrete RV 𝑋 can be calculated using the 3rd property of PMF, i.e., let the set 𝐴 contain the possible values of 𝑋 that are less than or equal to 𝑥 (𝑥 ∈ ℝ), then the CDF of 𝑋 evaluated at 𝑥 is given by $$𝐴 = \{𝑥_𝑖 | 𝑥_𝑖 ≤ 𝑥\}$$ $$F(x) = P(X\leq x) = P(X\in A) = \sum_{x_i\leq x} p(x_i).$$
- Properties of CDF:
- 𝐹 is non-decreasing, i.e., 𝐹 may be constant, but otherwise it is increasing.
- $\displaystyle{\lim_{x\to-\infty} F(x) = 0}\ $ and $\displaystyle{\lim_{x\to\infty} F(x) = 1}$
Example 1.2.3¶
$F(x) = \left\{\begin{array}{l l} 0, & \text{for}\ x<0 \\ 0.25 & \text{for}\ 0\leq x <1 \\ 0.75 & \text{for}\ 1\leq x <2 \\ 1 & \text{for}\ x\geq 2. \end{array}\right.$ </br> </br>
$F(-3) = P(X\leq -3) = 0 \\$ $F(0.9)= P(X\leq 0.9) = P(X=0) = 0.25 \\$ $F(1.4) = P(X\leq 1.4) = \displaystyle{\sum_{x_i\leq1.4}}p(x_i) = p(0) + p(1) = 0.25 + 0.5 = 0.75 \\$ $F(2.3) = P(X\leq 2.3) = p(0) + p(1) + p(2) = 0.25 + 0.5 + 0.25 = 1 \\$ $F(18) = P(X\leq18) = P(X\leq 2) = 1 \\$
Expected Values (EV) and Variance¶
- EV: mean of random variable 𝑋 (weighted average), or measure of center of 𝑋, or long-run value of X (imagine repeated random experiment many times) $$\text{E}[X] = \sum_{i} x_i P(x_i)$$
Variance: mean squared deviation of 𝑋 from its EV $$\text{Var}[X] = \text{E}[(X - \text{E}[X])^2] = \text{E}[X^2] - (\text{E}[X])^2 = \sum_{i} (x_i - \text{E}[X])^2 P(x_i)$$
- $(X−\text{E}[X])$ : The deviation of 𝑋 from its mean (EV).
- $(X−\text{E}[X])^2$ : The squared deviation, which eliminates negative signs and emphasizes larger deviations.
- $\text{E}[(X - \text{E}[X])^2]$ : The expected value of the squared deviations from mean, providing the mean of squared deviation of 𝑋 from its mean.
EV of Linearity: In many case, we may not be interested in the value of a random variable itself, but rather in a function applied to the random variable or a collection of random variables. For example, expected value of linear functions of random variables. $$\text{E}(g(X)) = \text{E}[aX + b] = a\text{E}[X] + b$$
$$\text{or,}$$
$$\text{E}[aX + bY] = a\text{E}[X] + b\text{E}[Y]$$
- Variance of Linearity: $$\text{Var}(aX+b) = a^2\text{Var}(X)$$ $$\text{or,}$$ $$ \text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y) \, \text{(if X and Y are independent)} $$
Example¶
The possible values of 𝑋 are {1, 2, 2, 3, 4, 5, 6}.
Probabilities:
- $P(X = 1) = 1/7$
- $P(X = 2) = 2/7$
- $P(X = 3) = 1/7$
- $P(X = 4) = 1/7$
- $P(X = 5) = 1/7$
- $P(X = 6) = 1/7$
Expected Value: $E[X] = (1 \times \frac{1}{7}) + (2 \times \frac{2}{7}) + (3 \times \frac{1}{7}) + (4 \times \frac{1}{7}) + (5 \times \frac{1}{7}) + (6 \times \frac{1}{7}) = \frac{23}{7}$
Variance: $Var[X] = (1^2 \times \frac{1}{7} + 2^2 \times \frac{2}{7} + 3^2 \times \frac{1}{7} + 4^2 \times \frac{1}{7} + 5^2 \times \frac{1}{7} + 6^2 \times \frac{1}{7}) - (\frac{23}{7})^2= \frac{136}{49}$
Continuous Random Variables¶
A continuous RV is a random variable with infinitely many possible values (e.g. an interval of real numbers [0,1]), such as height and time.
Probability Density Functions (PDF)¶
The probability density function computes the probabilities for continuous RV, denoted as 𝑓, and satisfies the following:
- $f(x) \geq 0, \text{for all } x\in\mathbb{R}.$
- 𝑓 is piecewise continuous.
- $\displaystyle{\int\limits^{\infty}_{-\infty}\! f(x)\,dx = 1}$
- $\displaystyle{P(a\leq X\leq b) = \int\limits^b_a\! f(x)\,dx}$
Expected Values (EV) and Variance¶
- EV: mean of random variable 𝑋 $$\mu = \mu_X = \text{E}[X] = \int_{-\infty}^{\infty} x f(x)\, dx.\notag$$
- Variance: mean squared deviation of 𝑋 from its EV $$\text{Var}(X) = \text{E}[X^2] - (\text{E}[X])^2 = \left(\int_{-\infty}^\infty x^2 f(x)\, dx\right) - \mu^2\notag$$
Distribution¶
Law of Large Numbers and Central Limit Theorem¶
Bayesian Theorem¶
Statistics¶
A study and practice of collecting and analysing data, include descriptive statistics and inferential statitics.
Descriptive Statistics¶
Descriptive statistics focus on summarizing and describe dataset itself by numerical and graphical methods, without drawing decisions or making predictions for a population.
Fundamental Concepts¶

Measure of Central¶
- Mean - sum/size
- Median - middle value or avg of 2 middle values
- Mode - most frequent value, used to describe categorical data
Measure of Spread¶
Range (Min, Max)
Variance - mean((data - mean)^2)
For a population variance: $$ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $$
For a sample variance: $$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} $$
Standard Deviation - sqrt(var), easier to understand since it's not square
Mean Absolute Deviation - mean(abs(data - mean)) $$ \text{MAD} = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}| $$
- SD involves squared deviations, which gives more weight to larger deviations, so it's more sensitive to outliers.
- MAD involves absolute deviations, which penalizes each deviations equally, so it's more robust when the data is skewed or non-normal.
- MAD is more robust but SD is more commonly used, especially when assuming a normal distribution.
Quartiles, Quantiles, Interquartile Range(IQR)
- Quartiles split up the data into 4 equal parts.
- Quantiles or percentiles are a generalized version of quartile, e.g. split up the data into 5 or 10 pieces.
- IQR is the difference between Q1 (25% percentile) and Q3 (75% percentile).
Outliers
- Data > Q3 + 1.5 * IQR
- Data < Q1 - 1.5 * IQR

Data Visualization¶
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
plt.rcParams['figure.figsize'] = [4, 3]
titanic = pd.read_csv('datasets/TitanicSurvival.csv')
# https://vincentarelbundock.github.io/Rdatasets/doc/carData/TitanicSurvival.htmltest = pd.read_csv('datasets/MASchools.csv')
test = pd.read_csv('datasets/MASchools.csv').dropna(subset=['salary', 'score8', 'score4'])
# https://vincentarelbundock.github.io/Rdatasets/doc/AER/MASchools.html
# np.mean(titanic['age'])
# np.nanmedian(titanic['age'])
print('mean = ', titanic['age'].mean())
print('median = ', titanic['age'].median())
print('min = ', titanic['age'].min())
print('max = ', titanic['age'].max())
print('var = ', titanic['age'].var())
print('sd = ', titanic['age'].std())
print('mad = ', (abs(titanic['age'] - titanic['age'].mean())).mean())
# find mode
titanic['age'].mode() # return a Series even if only one value returned
titanic['age'].value_counts().head(1)
titanic['sex'].value_counts()
# find quartile and outlier
quartile = titanic['age'].quantile([0, 0.25, 0.5, 0.75, 1])
iqr = quartile[0.75] - quartile[0.25]
titanic.loc[(titanic['age'] > quartile[0.75] + iqr * 1.5) | (titanic['age'] < quartile[0.25] - iqr * 1.5)]
mean = 29.881134512434034 median = 28.0 min = 0.166700006 max = 80.0 var = 207.74897359935744 sd = 14.413499699911796 mad = 11.262330473363223
0 24.0 Name: age, dtype: float64
age 24.0 47 Name: count, dtype: int64
sex male 843 female 466 Name: count, dtype: int64
| rownames | survived | sex | age | passengerClass | |
|---|---|---|---|---|---|
| 9 | Artagaveytia, Mr. Ramon | no | male | 71.0 | 1st |
| 14 | Barkworth, Mr. Algernon Henry W | yes | male | 80.0 | 1st |
| 61 | Cavendish, Mrs. Tyrell William | yes | female | 76.0 | 1st |
| 81 | Crosby, Capt. Edward Gifford | no | male | 70.0 | 1st |
| 135 | Goldschmidt, Mr. George B | no | male | 71.0 | 1st |
| 285 | Straus, Mr. Isidor | no | male | 67.0 | 1st |
| 506 | Mitchell, Mr. Henry Michael | no | male | 70.0 | 2nd |
| 727 | Connors, Mr. Patrick | no | male | 70.5 | 3rd |
| 1235 | Svensson, Mr. Johan | no | male | 74.0 | 3rd |
# bar plot -> 1 qualitative
array = titanic['survived'].value_counts()
plt.title('Titanic Survied Data')
plt.xlabel('Survied')
plt.ylabel('Number of Survied')
plt.bar(array.index, array.values, width=0.5);
# semi-colon to omit the text output
# box plot -> 1 quantitative
# note the nan
data = titanic['age'].dropna()
fig, ax = plt.subplots()
# https://stackoverflow.com/questions/34162443/why-do-many-examples-use-fig-ax-plt-subplots
# plt.subplots() is a function that returns a tuple containing a figure and axes objects.
# plt.subplots() is equal to plt.subplots(11), plt.subplots(1, 1) and plt.subplots(nrows=1, ncols=1)
# fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(nrows=2, ncols=2)
# fig, [ax1, ax2, ax3, ax4] = plt.subplots(nrows=1, ncols=4)
# https://stackoverflow.com/questions/52214776/python-matplotlib-differences-between-subplot-and-subplots
# plt.subplots() VS. plt.subplot()
# plt.subplots() - single function to create a figure with several subplots
# fig, axes = plt.subplots(nrows=2, ncols=3): create 2*3 array of axes objects
# plt.subplot() creates only a single subplot axes at a specified grid position.
# This means it will require several lines of code
# to achieve the same result as plt.subplots() did in a single line of code above:
# fig = plt.figure()
# ax = plt.subplot(231)
# ax = plt.subplot(232)
# ax = plt.subplot(233)
# ax = plt.subplot(234)
# ax = plt.subplot(235)
# ax = plt.subplot(236)
# https://blog.csdn.net/htuhxf/article/details/82986440
ax.boxplot(data)
ax.set_xticklabels(['age']);
# histogram -> 1 quantitative
plt.xlabel('Age')
plt.ylabel('Number of people')
plt.hist(data);
# clustered bar plot - 2 qualtative
sexarray_f = titanic.loc[titanic['sex'] == 'female', ['survived']].value_counts().sort_index()
sexarray_m = titanic.loc[titanic['sex'] == 'male', ['survived']].value_counts().sort_index()
plt.bar(array.index, sexarray_f.values, color='r')
plt.bar(array.index, sexarray_m.values, bottom=sexarray_f.values, color='b')
plt.legend(['Female', 'Male'])
plt.ylabel('Number of people')
plt.xlabel('Survied');
# clustered bar plot - 2 qualtative
classarray_1 = titanic.loc[titanic['passengerClass'] == '1st', ['survived']].value_counts().sort_index()
classarray_2 = titanic.loc[titanic['passengerClass'] == '2nd', ['survived']].value_counts().sort_index()
classarray_3 = titanic.loc[titanic['passengerClass'] == '3rd', ['survived']].value_counts().sort_index()
width = 0.2
x = np.arange(2)
plt.bar(x-width, classarray_1, width, color='r')
plt.bar(x, classarray_2, width, color='b')
plt.bar(x+width, classarray_3, width, color='g')
plt.xticks(x, ['Not Survived', 'Survived'])
plt.legend(["1st class", "2nd class", "3rd class"], title='Passenger Class');
# scatter plot -> 2 quantatives
plt.scatter(test['score4'], test['score8'], c=test['salary'])
plt.colorbar();
# double box plot -> 1 quantative + 1 qualtative
agearray_y = titanic.loc[titanic['survived'] == 'yes', 'age'].dropna()
agearray_n = titanic.loc[titanic['survived'] == 'no', 'age'].dropna()
fig, ax = plt.subplots()
ax.boxplot([agearray_n, agearray_y], tick_labels=['Not Survived', 'Survived'])
ax.set_ylabel('Age')
ax.set_xlabel('Outcomes')
ax.set_title('Titanic Survived Data');
Statistical Inference¶
Reference¶
Maker Reference¶


Line Reference¶

Color Reference¶

